perf: enable MoE GroupedGEMM for MoE models#2278
Conversation
Wires moe_grouped_gemm (CUTLASS grouped GEMM for MoE experts) through the MegatronConfig TypedDict and _apply_moe_config(). Enables it in every root MoE performance recipe (Qwen3-30B-A3B 4n4g/4n8g/4n8g-40K, Qwen3-235B 16n8g, DeepSeek-V3 32n8g, DAPO DeepSeek-V3 64n8g); child recipes inherit. Signed-off-by: sna <sna@nvidia.com>
f7bdb19 to
1713776
Compare
|
/ok to test 1713776 |
terrykong
left a comment
There was a problem hiding this comment.
Summary
This PR adds moe_grouped_gemm support through the MegatronConfig TypedDict and _apply_moe_config(), enabling it in 6 root MoE performance recipes. The concept is sound and the feature is valuable, but the implementation has a critical issue with the config application pattern that could silently regress throughput for ~14 other MoE recipes.
Ship blocker
Silent performance regression: The .get("moe_grouped_gemm", False) pattern unconditionally overwrites Bridge's default of True with False for any config that omits the field. See inline comment for details and suggested fix.
Suggestions
- Performance evidence: Could you share before/after throughput data (tokens/sec or step time) on at least one representative recipe? A short convergence sanity check would also be helpful.
- PR description: The "What does this PR do?", "Issues", and "Usage" sections still have template placeholders — consider filling them in.
Generated by Claude Code
Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Seonjin <sna@nvidia.com>
Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Seonjin <sna@nvidia.com>
|
/ok to test 2520299 |
|
/ok test 1cd8dd1 |
|
@terrykong I've made changes based on your suggestions. Does this PR look good to you? |
|
/ok to test ab2dacb |
|
/claude review |
|
/ok to test a9b5190 |
Support moe_grouped_gemm (grouped GEMM for MoE experts) through the MegatronConfig TypedDict and _apply_moe_config(). Enables it in every root MoE performance recipes (Qwen3-30B-A3B 4n4g/4n8g/4n8g-40K, Qwen3-235B 16n8g, DeepSeek-V3 32n8g, DAPO DeepSeek-V3 64n8g); child recipes inherit.
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information